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Abstract. While many existing formal concept analysis algorithms are 
efficient, they are typically unsuitable for distributed implementation. 
Taking the MapReduce (MR) framework as our inspiration we intro- 
duce a distributed approach for performing formal concept mining. Our 
method has its novelty in that we use a light-weight MapReduce run- 
time called Twister which is better suited to iterative algorithms than 
recent distributed approaches. First, we describe the theoretical foun- 
dations underpinning our distributed formal concept analysis approach. 
Second, we provide a representative exemplar of how a classic central- 
ized algorithm can be implemented in a distributed fashion using our 
methodology: we modify Ganter's classic algorithm by introducing a 
family of MR* algorithms, namely MRGanter and MRGanter+ where 
the prefix denotes the algorithm's lineage. To evaluate the factors that 
impact distributed algorithm performance, we compare our MR* algo- 
rithms with the state-of-the-art. Experiments conducted on real datasets 
demonstrate that MRGanter+ is efficient, scalable and an appealing al- 
gorithm for distributed problems. 

Keywords: Formal Concept Analysis; Distributed Mining; MapReduce 



1 Introduction 

Formal Concept Analysis (FCA), pioneered in the 80's by Wille [T], is a method 
for extracting formal concepts -natural clusters of objects and attributes- from 
binary object-attribute relational data. FCA has great appeal in the context of 
knowledge discovery [5] , information retrieval [3] and social networking analysis 
applications [1] because arranging data as a concept lattice yields a powerful and 
intuitive representation of the dataset [115] . 

FCA relies on closure operation which searches implication of attributes (obe- 
jcts) [fr. According to this property, new formal concepts may be extracted iter- 
atively by mapping a set of attributes (objects). While existing FCA algorithms 
perform this procedure iteratively and needs to access datasets each iteration. 
They are appropriate to process small centralized datasets. The recent explosion 
in dataset sizes, privacy protection concerns, and the distributed nature of the 
systems that collect this data, suggests that efficient distributed FCA algorithms 
are required. In this paper we introduce a distributed FCA approach based on a 
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light-weight Map Reduce runtime called Twister [7J, which is suited to iterative 
algorithms, scales well and reduces communication overhead. 

1.1 Related Work 

Some well-known algorithms for performing FCA include Ganter's algorithm [8], 
Lindig's algorithm 9 and CloseByOne jlOll lj and their variants |12I13] . Gan- 
ter introduces lectic ordering so that not all potential attribute subsets of the 
data have to be scanned when performing FCA. Ganter's algorithm computes 
concepts iteratively based on the previous concept without incurring exponential 
memory requirements. In contrast, CloseByOne produces many concepts in each 
iteration. Bordat's algorithm [14] runs in almost the same amount of time as Gan- 
ter's algorithm, however, it takes a local concept generation approach. Bordat's 
algorithm introduces a data structure to store previously found concepts, which 
results in considerable time savings. Berry proposes an efficient algorithm based 
on Bordat's approach which require a data structure of exponential size |15j . A 
comparison of theoretical and empirical complexity of many well-known FCA 
algorithms is given in [16] . In addition, some useful principles for evaluating al- 
gorithm performance for sparse and dense data are suggested by Kuznetsov and 
Obiedkov; We consider data density when evaluating our approach. 

The main disadvantage of the batch algorithms discussed above is that they 
require that the entire lattice is reconstructed from scratch if the database 
changes. Incremental algorithms address this problem by updating the lattice 
structure when a new object is added to database. Incremental approaches have 
been made popular by Norris [T7J , Dowling [T5] , Godin et al. [H] , Capineto and 
Romano [50], Valtchev et al. [3T] and Yu et al. [22]. In recent years, to reduce con- 
cept enumeration time, some parallel and distributed algorithms have been pro- 
posed. Krajca et al. proposed a parallel version based on CloseByOne [13] ■ The 
first distributed algorithm [23] was developed by Krajca and Vychodil in 2009 
using the Map Reduce framework [24]. In order to encourage more wide-spread 
usage of FCA, beyond the traditional FCA audience, we propose the develop- 
ment and implementation of efficient, distributed FCA algorithms. Distributed 
FCA is particularly appealing as distributed approaches that can potentially 
take advantage of cloud infrastructures to reduce enumeration time perhaps, are 
attractive to practitioners. 

1.2 Contributions 

We utilize the MapReduce framework in this paper to execute distributed al- 
gorithms on different nodes. Several implementations of MapReduce have been 
developed by a number of companies and organizations, such as Hadoop MapRe- 
duce by Apach^E and Twister Iterative MapReduccI, since its inception by 
Google in 2004. A crucial distinction between the present paper and the work of 
Krajca and Vychodil [23J is that we use a Twister implementation of MapReduce. 
Twister supports iterative algorithms [Jj: we leverage this property to reduce 
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Table 1. The symbol x indicates that an object has the corresponding attribute. 
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the computation time of our distributed FCA algorithms. In contrast, Hadoop 
architecture is designed for performing single step MapReduce. We implement 
new distributed versions (MRGanter and MRGanter+) of Ganter's algorithm 
and empirically evaluate their performance. In order to provide an established 
and credible benchmark under equivalent experimental conditions, MRCbo, the 
distributed version of CloseByOne is implemented as well using Twister. 

This paper is organized as follows. Section [5] gives an overview of Formal 
Concept Analysis and Ganter's algorithm. The theoretical underpinnings for 
implementing FCA using distributed datasets are described in Section [3] to sup- 
port our approach. Our main contribution is a set of Twister-based distributed 
versions of Ganter's algorithm. Section 2] presents an implementation overview 
and comparison of Twister and Hadoop MapReduce. Empirical evaluation of 
the algorithms proposed in this paper is performed using real datasets from the 
UCI KDD machine learning repository, and experimental results are discussed 
in Section [5] In summary, MRGanter+ performs favourably in comparison to 
centralized versions. 

2 Formal Concept Analysis 

We continue by introducing the notational conventions used in the sequel. Let 
O and P denote a finite set of objects and attributes respectively. The data 
ensemble, 5, may be arranged in Boolean matrix form as follows: the objects 
and attributes are listed along the rows and columns of the matrix respectively; 
The symbol x is entered in a row-column position to denote an object has that 
attribute; An empty entry denotes that the object does not have that attribute. 
Formally, this matrix describes the binary relation between the sets O and P. 
The object set X has attribute set Y if (X, Y) G I, X e O and Y <E P. The triple 
(O, P, I) is called a formal context. For example, in Table[TJ O = {1, 2, 3, 4, 5, 6} 
and P — {a, 6, c, d, e, /, g}, thus object {2} has attributes {a, c, e, g}. 

We define a derivation operator on X and Y where ICO and Y C P as: 

x' = { P eP\ VteO: (t, P ) e 1} (l) 
Y' = {teO\ VpeP :(t,p)el}. (2) 

The operation X' generates the set of attributes which are common to all objects 
in X. Similarly, Y' generates the set of all objects which are common to all 
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attributes in Y . A pair (X, Y) is called a formal concept of (O, P, /) if and only 
if X C O, y C P, X' = Y, and y' = X, where X and Y are called its extent and 
intent. The crucial property of a formal concept is that the mappings X i— > X" 
and y M> y", commonly known as closure operators, hold. The closure operator 
is used to calculate the extent and intent that form a formal concept. 

In the following sections we describe established algorithms for concept min- 
ing, namely Ganter's algorithm (also known as NextClosure) and CloseByOne. 
We then introduce our distributed extensions of these approaches. 

2.1 Ganter: Iterative Closure Mining Algorithm 

The NextClosure algorithm describes a method for generating new closures which 
guarantees every closure is enumerated once. Closures are generated iteratively 
using a pre-defined order, namely lectic ordering. The set of all formal concepts 
is denoted by JF . Let us arrange the elements of P = {pi, • • • ,f>j, • • • ,p m } in an 
arbitrary linear order pi < p2 < ■ ■ ■ < Pi < ■ ■ ■ < p m , where m is the cardinality 
of the attribute set, P. The decision to use lectic ordering dictates that any 
arbitrarily chosen subset of P is also ordered according to the lectic ordering 
which was defined ab initio. Given two subsets Y\, Y 2 C P, Y\ is lectically 
smaller than Y 2 if the smallest element in which Y\ and Y 2 differ belongs to Y 2 . 

y < y 2 :•<=>• 3 P M e Y 2 , Pi i yi,v Pi<P4 fe e y -<=> ^ e y 2 )). (3) 

NextClosure uses Eqn. © as a feasibility condition for accepting new candidate 
formal concepts. Typically this difference in set membership is made more ex- 
plicit by denoting the smallest element, pi, in which the set Y\ and Y 2 differ. 

y < Pi y 2 :^=> ipM e Y 2 , Pi i Y u y P]<Pt ( P] e y Pj e y)). (4) 

To fix ideas, if the order of P = {a, &, c, d, e, /, <?} is defined asa<5<c<d< 
e < / < g, and two subsets of P, or itemsets, Y\ — {a, c, e, g} and Y2 = {a, b, e, 5} 
are examined then Y\ < Y2 because the smallest element in which the two sets 
differ is b and this element belongs to Y 2 . 

In general, each subset Y C P may yield a closure, y" C P; The NextClo- 
sure algorithm attempts to find all closures systematically by exploiting lectic 
ordering. The generative operation is the ©-operation: a new intent is generated 
by applying © on an existing intent and an attribute. Let the ordering of P be 
Pi < Vi < • • • < Pi < • • • < Pm, and consider the subset Y C P. The ©-operator 
is defined as: 

Y®p i :=((Yn{pi,...,p i - 1 })U{p i })", where Y C P and Pl C P. (5) 

NextClosure then compares the new candidate formal concept with the previous 
concept. If the condition in Eqn. Q is satisfied the candidate concept produced 
by Eqn. flS} is kept. 

The ©-operator in Eqn. ([5]) consists of intersection, union and closure opera- 
tions; Lectic ordering and the associated complexity of these operations explains 
why NextClosure's ordered approach incurs high computational expense, and 
consequently why the largest dataset-size NextClosure can practically process is 
relatively small. 
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Table 2. Formal concepts mined from Table [TJ including empty concepts. 

F 1 : ({1, 2, 3, 4, 5, 6}, {}) F 8 : <{1, 3, 4, 6}, {&}} F 15 : {{1, 2, 5}, {a}) 

F 2 : ({1, 3, 5, 6}, {/}} F 9 : ({1, 3, 6}, {b, /}) F le : {{2, 5}, {a, e}> 

F 3 : ({2, 4, 5}, {e}> F 10 : ({1, 3, 4}, {6, d}} F 17 : <{1, 5}, {a, d, /}) 

F 4 : {{1, 3, 4, 5}, {d}} F li: {{1, 3}, {b, d, /}> F 18 : ({5}, {a, d, e, /}) 

F 5 : ({1,3,5}, {d,/}} F 12 : {{4},{6,d,e}) F 19 : ({2}, {a, c, e, ff }> 

F 6 : ({4, 5}, {d, e}) F 13 : {{3 6}, {6, c, /, <?}> F 20 : {{1}, {a, b, d, /}) 

F 7 : ({2, 3, 6}, {c, <?}} F 14 : ({3}, {b, c, d, /, g}) F 21 : {{}, {a, 6, c, d, e, /, <?}> 



Algorithm 1 AllClosure 

Input: 0: null attribute set. 
Output: J 7 : Formal concepts set. 
1: Y i- 0"; 

2: while Y is not the last closure do 
3: Y <- NextClosurcQ; 
4: 7^ JUY; 
5: end while 
6: return J 7 



Algorithm 2 NextClosure 

Input: O, P, /, Y: formal context & current 

intent. 
Output: Y. 

1: for pi from p m down to pi do 

2: if pi £ Y then 

3: candidate <— V ® pi| 

4: if candidate < p . Y then 

5: Y i— candidate; 

6: break; 

7: end if 

8: end if 

9: end for 
10: return Y 



Example 1 Consider the formal context in Table [7J Assume we have a con- 
cept ({1, 5}, {a, d, /}}. W^e take the attribute set, Y — {a,d,f}, and calculate, 
Y © e. First, we compute, {a,d,f} n {a, 6, c, <i} = {a, d}, t/ien we append e 
and generate {a, d} U {e} = {a,d,e}. Performing {a,d,f} (Be — {a, d, e}" 
yields the set, {a, d, e, /}. To demonstrate the role of lectic ordering, we compute 
Y(Bc = {a,c, e}. According to the feasibility condition in (Eqn.^), {a,d,e,f} < c 
{a,c, e}. Thus, the set, {a,c,e}, is added to the concept lattice, J- . By repeat- 
ing this process, NextClosure determines that there are 21 formal concepts in 
the concept lattice representation of the formal context in Table [TJ The set of 
concepts, J- , is listed in Tabled 

Pseudo code for NextClosure is described in the Algorithm [T] and [5] as back- 
ground to our distributed approach. Algorithm [T] applies the closure operator 
on the null attribute set and generates the first intent, Y, which is the base 
for all subsequent formal concepts. New concepts are generated in turn by call- 
ing Algorithm [5] and concatenating the resultant concepts to the set of formal 
concepts, J- . As each candidate intent is extended with new attributes, the last 
intent should be the complete set of attributes. This feature is used to terminate 
the loop (in Line [5] of the Algorithm [1) . Algorithm [5] accepts the formal context 
triple, (O, P, /) and current intent, Y, as inputs. By convention, the attribute set 
P is sorted in descending order. The ©-operator described in Eqn.[S]is applied to 
produce candidate formal concepts. The concept feasibility condition Eqn. Q 
is used to verify whether a new candidate should be added to the set of for- 
mal concepts, T . The approach taken in the CloseByOne algorithm is similar in 
spirit to the approach taken by the NextClosure algorithm: CloseByOne gener- 
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Table 3. Partitioned datasets Si and S2 derived from Table [T] 
Si or (Q Sl ,P,I Sl ) S 2 or (Os 2 ,P,Is 2 
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ates new formal concepts based on concept (s) generated in the previous iteration 
and tests their feasibility using the operator, < Pi . The crucial difference is that 
the CloseByOne algorithm generates many concepts in each iteration. CloseBy- 
One terminates when there are no more concepts that satisfy Eqn. In short, 
NextClosure only finds the first feasible formal concept in each iteration whereas 
CloseByOne potentially generates many. As a consequence, CloseByOne requires 
far fewer iterations. 

The appeal of NextClosure, and explanation for our desire to make it more 
efficient lies in its thoroughness; the guarantee of a complete lattice structure 
which is a consequence of the main theorem of Formal Concept Analysis [6]. 
This thoroughness is due to lectic ordering and the iterative approach deployed 
by NextClosure; however, thoroughness comes at the cost of high complexity. 
The advent of efficient mechanisms for dealing with iterative algorithms using 
MapReduce captured by Twister allow us to couple NextClosure's thoroughness 
with a practical distributed implementation in this paper. 



3 Distributed Algorithms for Formal Concept Mining 

We continue by describing two methods for performing distributed NextClosure, 
namely, MRGanter and MRGanter+. An introduction to Twister is deferred 
to Section 0] We start by describing the properties of a partitioned dataset 
compared to its unpartitioned form. In many cases these properties are simply 
restatements of the properties of the derivations operators. 

Given a dataset S, we partition its objects into n subsets and distribute the 
subsets over n different nodes. Without loss of generality, it is convenient to limit 
n = 2 here. We denote the partitions by Si and S2. Alternatively we can think 
in terms of formal contexts and write the formal context, (0,P,I), in terms of 
the partitioned formal contexts (Os 1 , P, Is x ) and (Os 2 , P, Is 2 )- To hx ideas, we 
use the dataset in Table [T] as an exemplar and generate the partitions in Table [31 
The partitions are non-overlapping: the intersection of the partitions is the null 
set, 5*i n 52 = and their union gives the full dataset S = Si U S2 ■ It follows that 
the partitions, Si, S2, have the same attributes sets, P, as the entire dataset S, 
however, the set of objects is different in each partition, e.g. Osx and Os 2 ■ 

Let Ys, Ys 1 and Ys 2 denote an arbitrary attribute set Y with respect to the 
entire dataset S, and each of its partitions Si and S% respectively. By construc- 
tion they are equivalent: Ys = Ys 1 = Y$ 2 . Similarly, Yg, Y s and Yl are the 
sets of objects derived by the derivation operation in each of the partitions Si , 
S*2 and the entire dataset S respectively. 
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Property 1 Given the formal context, (O, P, I), the two partitions (Os ± , P, Is^ ) 
and (Og 21 P,Is 2 ), we have the property Yg = Yg U Yg 2 : the union of the sets 
of objects generated by the derivation of the attribute sets Ys t and Ys 2 overs the 
partitions is equivalent to the set of objects generated by the derivation of the 
attribute set Yg over the entire dataset, S . 

Appealing to the definition of the derivation operator proposed by Wille in pQ, 
the set, Y s , is a subset of O, Y s C O. Moreover, Y Si C Sl and Y Ss C Os 2 . 
Given Si U S 2 = S and Si R S 2 = 0, it follows that, Sl UO Si = O and 
Sl n S2 = 0; Therefore, Y Sl C Y s and F^ C Fg. Finally, Fg U F^ ee Y s . As 
a counterexample, an object t that exists in Yg, but not in or Yg 2 , cannot 
exist because Os 1 U Os 2 = O and Ogj n Og 2 = and Fg = Fgj = Fg 2 . If t is 
in Y s it must appear in Yg or F^ 2 . In short, Property [T] allows us to process 
all objects independently: the objects can be distributed and processed in an 
arbitrary order and this will not affect the result of Y' . Property [1] is trivially 
extended to the case of n partitions. Now we describe how formal concepts can 
be combined from different partitions. 

Property 2 Given the formal context, (O, P, I), the two partitions (Os 1 , P, Isx ) 
and {Os 2 , P, Is 2 ) > we have the property Yg = Yg PI Y s ' 2 : The intersection of the 
closures of the attribute set, Yg t and Yg 2 , with respect to the partitions Si and 
5*2 is equivalent to the closure of the attribute set, Yg, with respect to the entire 
dataset S . 

By the definition of the partition construction method above, Si U 52 = S, 
which implies that, Si C S and S 2 C S. Recall that, Y Si C Y s and Yg 2 C Y s , 
and from Property [1] we have that Yg = Y Si U Yg 2 . Appealing to the properties 
of the derivation operators, in pQ, we have, Yg 3 Yg and Yg 2 F^ . It is 
clear that Yg and Yg 2 need not equal Yg, but by the definition of a closure 
(Y Si U Y S J = (Yg)' =V S , thus, (Y Si U Y S J = F^ n Y^ follows trivially from 
the definition of the derivations operators. 

Example 2 Consider the following example of Property^ Taking itemset Y = 
{b, d}. We derive Yg = {b,d,f} from the first partition Si, and Yg — {b, d, e} 
from S 2 - We derive Yg — {b, d} for the entire dataset S. Therefore Yg = Yg R 

Theorem 1 Given a set of attributes Y , Y C P. Let J 7 ^ and be the sets 
of closures based on Y in relation to Si and S 2 respectively. Then the closure 
set of Y in relation to S can be calculated from: J-g = J-"Jf R J-^f 

This is simply a consequence of Property[2]as, Fg = Yg — Yg C\Yg 2 = J 7 ^ R Tg 2 
and Yg = Yg t = Yg 2 by definition of the partition. 

Example 3 Consider again Example [H Appealing to Theorem [7J the formal 
concept with respect to the entire data set is the intersection of the formal con- 
cepts from each partition Fg — Fg R Fjf = {b,d,f\ R { b, d, e} ={ b, d} . 

We denote the fc-th partition as Sk where k = 1, • ■ • , n and then propose: 
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Theorem 2 Given the closures JFg , ■ ■ ■ , J~g from n disjoint partitions, J-g — 

A trivial inductive argument establishes that Theorem[2]holds. Theorem[T]proves 
the n — 2 case. Theorem[2]follows by recognizing that the dataset S at the (fe— 1)- 
th step of the proof can be thought as of consisting of two partitions only, the 
partition Si U • • ■ U Sk~i and a second partition S*. 

Calling on nothing more complex than: 1) the properties of the derivation 
operators, and 2) construction of non-overlapping partitions, we leverage The- 
orem [5] in order to apply the MapReduce, specifically the Twister variant, to 
calculate closures from arbitrary number of distributed nodes sure in the knowl- 
edge that the thoroughness of NextClosure is preserved. 

3.1 MRGanter 

In order to address the dataset size limitations imposed on NextClosure -owing 
in particular to the complexity of the ©-operation- we propose to deploy FCA 
across multiple nodes in order to reduce the computation time. We address the 
problem of how to decompose NextClosure so that each sub-task can be executed 
in parallel. In the Algorithm [2j there were two stages involved in computing 
NextClosure: 1) computing a new candidate closure, and 2) making a judgement 
on whether to add it to the evaluated formal concepts. In MapReduce parlance, 
computing a new candidate closure corresponds to the map stage, and validating 
its feasibility corresponds to the reduce phase. For the purpose of this discussion, 
we only calculate the intent of a formal concept. In practice, we calculate an 
extent based on the intent and previous extent. The variables and constants 
used in these algorithms are summarized in Table [U 



Table 4. Table of variables and constants for distributed FCA algorithms. 



Variables/Constants 


Description 


P-i 


an attribute in P, where i — 1, • • • , m 


L_k 


the complete set of local closures in data partition k where 
k = 1, • • • , n. It will be transfered from mapper to reducer 


1J 


an intent in L_k which is derived from pj 


d 


the intent produced in the previous iteration 


f 


the newly generated intent 


G 


a container for storing newly generated intents 



The main operation in the merging function is the intersection operator, 
which is applied on the set of local closures L_k generated at each node. Algo- 
rithm [3] gives the pseudo code for the merging function based on Theorem [21 
To describe the merging operation, we introduce the notation, "^(Li, f) =1J n f, 
which acts on two intents. The merging function is deployed at the reduce phase 
and only processes the local closures derived from the same attribute (Line [IJ . 

The Map phase described in the Algorithm 0] produces all local closures. The 
output consists of the previous intent d and a set of local intents L_k. In order 
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Algorithm 3 Merging function 

Input: p_i, L_k, f. 
Output: /. 

1: U <— the local closure in L_k in terms of 

2: f <- #(IJ, f); 
3: return f 



Algorithm 4 Map: MRGanter 

Input: d. 
Output: (d, L.k). 

1: for p_i from p_m down to p_l do 

2: if p_i is not in d then 

3: Li «- d ffi pj; 

4: associate l_i with p_i; 

5: L.k <- L.k u Li; 

6: end if 

7: return (d, L_k); 

8: end for 



Algorithm 5 Reduce: MRGanter 

Input: (d,L.k). 
Output: f. 

1: for pj in P do 

2: f -4— initialize new intent; 

3: for i from 1 up to m do 

4: f «— mcrging(p_i. L_k, f); 

5: end for 

6: if f < p _i d then 

7: break; 

8: else 

9: continue; 
10: end if 

11: end for 

12: return f 



Algorithm 6 Reduce: MRGanter+ 

Input: (d, L.k). 
Output: G. 

1: H initialize a two-level hash table; 

2: for pi in P do 

3: f 4— initialize new intent; 

4: for i from 1 up to m do 

5: f <— mcrging(p_i, L_k, f); 

6: end for 

7: if f is not in H then 

8: add f into H; 

9: add f into G; 

10: end if 

11: end for 

12: return G 



to be used in the merging function the attribute which was used to form local 
closures should be recorded and passed, as Line H] does. All pairs which have the 
same key, d, will be sent to the same reducer. All local intents are used to form 
global intents in reduce phase. 

Algorithm [5] accepts (d,L_k) from the k-th mappers (see Section 2]), where 
k = 1, • • • , n. Only pairs who have the same key, d, are accepted by a Reducer. 
LineElgenerates an candidate closure f. This candidate is then validated. Finally, 
the successful candidate will be outputted as global closure f. 

Fig. [1] depicits the iterative flow of control of MRGanter; the lines marked 
with "S" import static data from each partition, while the lines marked with 
"D" configure each map with the previous closure. Each new closure is tested to 
see if it is the last, e.g. it contains all attributes, P. If this condition is not met 
MRGanter continues. 

We present a worked example using the dataset in Table [3] Table O illus- 
trates a few results due to space limitations. In practice, MRGanter performs 20 
iterations to determine all concepts. 



3.2 MRGanter + 

NextClosure calculates closures in lectic ordering to ensure every concept appears 
only once. This approach allows a single concept to be tested with the closure 
validation condition during each iteration. This is efficient when the algorithm 
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Reduce 1\ 



Data Split T\ \° 




[pata Splitn] 




runMapReduceO 




1? , 


, P > 

Map 1 




Map | 




D | D 






computeClosure() " 




" computeClosure() 


1 




(airl, localClosurel) 
I 

(atri, localClosurei) 


{atrl, localClosurel) 
I 

{atrj. localClosurej) 







/Reduce n 




Fig. 1. MRGanter work flow: static data is loaded at the start of the procedure (labeled 
by S) and the dynamic data (closures produced during each iteration) is passed and 
used in the next iteration (labeled by D). 



runs on a single machine. For multi-machine computation, the extra computation 
and redundancy resulting from keeping only one concept after each iteration 
across many machines is costly. We modify NextClosure to reduce the number 
of iterations and name the corresponding distributed algorithm, MRGanter+. 

Rather than using redundancy checking, we keep as many closures as possi- 
ble in each iteration; All closures are maintained and used to generate the next 
batch of closures. To this end, we modify Algorithm [5J the Map algorithm re- 
mains the same as in Algorithm [U Algorithm [5] describes the ReduceTask for 
MRGanter+. The Reduce in MRGanter+ first merges local closures in Line 03 
and then recursively examines if they already exist in the set of global formal 
concepts H (Line[7]). The set H is used to fast index and search a specified closure, 
and it is designed as a two-level hash table to reduce its costs. The first level is 
indexed by the head attribute of the closure, while the second level is indexed 
by the length of the closure. The new closures are stored in G. We present a 
running example based on the dataset in Table [3] for the purpose of comparison. 
MRGanter+ produces many intents in each iteration. New intents are kept if 
they are not already in H. Notably, MRGanter+ requires 3 iterations to mine 
all concepts. 



4 Twister MapReduce 



The MapReduce framework adopts a divide-conquer strategy to deal with huge 
datasets and is applicable to many classes of problems [25] . A large number of 
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Table 5. MRGanter: In each iteration, 
only single a intent (bold) satisfies the 
condition. 



d 




Li from Si 


Li from 52 


f 





g 
f 
e 
d 
c 
b 
a 


{c,g} 
{b,d,f} 
{a,c,e,g} 
{b,d,f} 

{c,g} 
{b,d,f} 
{a} 


{b,c,f,g} 

{f} 
{d,e} 
{d,e} 
{b,c,f,g} 

{b} 
{a,d,e,f} 


{c>g} 
{f} 

W 
{d} 
{c,g} 
{b} 
{a} 


{f} 


g 
e 
d 
c 
b 
a 


{b,c,d,f,g} 
{a,c,e,g} 
{b,d,f} 

{c,g} 
{b,d,f} 

{a} 


{b,c,f,g} 
Re} 
{d,e} 

{b,c,f,g} 

{b} 
{a,d,e,f} 


{b,c,f,g} 
{e} 
{d} 
{c.g} 
{b} 
{a} 


w 


g 
f 
d 
c 
b 
a 


{a,c,e,g} 
{a,. • • ,g} 
{b,d,f} 

{c,g} 
{b,d,f} 

{a} 


{a,. • • ,g} 
{a,d,e,f} 

{d,e} 
{b,c,f,g} 

{b} 
{a,d,e,f} 


{a,c,e,g} 
{a,d,e,f} 

{d} 
{c,g} 

{b} 

{a} 


{d} 


g 
f 
e 
c 
b 
a 


{b,c,d,f,g} 
{b,d,f} 

{a,. • • ,g} 
{c,g} 
{b,d,f} 

{a} 


{a,. • • ,g} 
{a,d,e,f} 

{d,e} 
{b,c,f,g} 

{b} 
{a,d,e,f} 


{b,c,d,f,g} 

{d,f} 

{d,e} 

{c>g} 
{b} 
{a} 



Table 6. MRGanter+: Many intents 
(bold) are maintained per iteration. 



d 


P-i 


1J from Si 


1J from 5*2 


f 





g 
f 
e 
d 
c 
b 
a 


{c>g} 
{b,d,f} 
{a,c,e,g} 
{b,d,f} 

{ c ,g} 

{b,d,f} 
{a} 


{b,c,f,g} 

« 
{d,e} 

{d,e 
{b,c,f,g} 

{b} 

{a,d,e,f} 


{ c >g} 

{f} 
{e} 
{d} 

{c,g} 
{b} 
{a} 


{eg! 


f 
e 
d 
b 

a 


{b,c,d,f,g} 
{a,c,e,g} 
{b,c,d,f,g} 
{b,d,f} 
{a} 


{b,c,f,g} 
{a,. • • ,g} 
{a,. • ■ ,g} 

{b} 
{a,d,e,f} 


{b,c,f,g} 
{a,c,e,g} 
{b,c,d,f,g} 

{b} 
{a} 




g 
e 
d 
c 
b 
a 


{b,c,d,f,g} 
{a,c,e,g} 
{b,d,f} 

{ c ,g} 

{b,d,f} 
{a} 


{b,c,f,g} 
{d,e} 
{d,e} 

{b,c,f,g} 

{b} 
{a,d,e,f} 


{b,c,f,g} 
{e} 
{d} 
{c,g} 
{b} 
{a} 


M 


g 
f 
d 
c 
b 
a 


{a,c,e,g} 
{a,. ■ • ,g} 
{b,d,f} 

{ c ,g} 

{b,d,f} 
{a} 


{a,. • ■ ,g} 
{a,d,e,f} 

{d,e} 
{b,c,f,g} 

{b} 
{a,d,e,f} 


{a,c,e,g} 
{a,d,e,f} 

{d} 
{c,g} 
{b} 
{a} 



computers, collectively referred to as a cluster, are used to run the algorithm in 
a distributed way. 

MapReduce was inspired by the map and reduce functions commonly used 
in functional programming, for example Lisp. It was introduced by Google [24] 
and then implemented by many companies (Google, Yahoo!) and organizations 
(Twister, Apache). These implementations provide automatic parallelization and 
distribution, fault-tolerance, I/O scheduling, status and monitoring. The only 
demand made of the user is the formulation of the problem in terms of map 
and reduce functions. We use the terminology mapper and reducer when we 
refer to the map and reduce function respectively. The map function takes an 
input pair and produces a set of intermediate key/value pairs. The MapReduce 
library provides the ability to acquire input pairs from files or databases which 
are stored in distributed way. Additionally, it can group all intermediate values 
associated with the same intermediate key / and pass them to the same reducer. 
The reduce function accepts an intermediate key / and a set of values associated 
with /. It merges these values to form a possibly smaller set of values. 

Twister [7] was designed to enhance MapReduce's functionality by efficiently 
supporting iterative algorithms. Twister uses a public/subscribe messaging in- 



12 Xu. et al, 



frastructure (we choose NaradaBrokerin^) for communication and data transfer, 
and introduces long running map/reduce tasks which can be re- used in different 
iterations. These long running tasks, which last for the duration of the entire 
computation, ensures that Twister avoids reading static data in each execution 
of MapReduce; a considerable saving. For iterative algorithms, Twister catego- 
rizes data as being either static or dynamic. Static data is the distributed data 
in local machines. Dynamic data is typically the data produced by the previous 
iteration. Twister's configure phase allows the specification of where the map- 
per reads the static data. Calculation is performed cyclically based upon the 
dynamic and static data. 

Unlike Twister, Hadoop focuses on single step MapReduce and lacks built- 
in support for iterative programs. For iterative algorithms, Hadoop MapReduce 
chains multiple jobs together. The output of a previous MapReduce task is used 
as the input for the next MapReduce tasl@. This approach is suboptimal; it 
incurs the additional cost of repetitively applying MapReduce -the disadvantage 
is that new map/reduce tasks are created repetitively for different iterations. 
This incurs considerable performance overhead costs. 

5 Evaluation 

We provide evidence of the effectiveness and scalability of our algorithm in this 
section. Subsection 15.11 describes the experimental environment and the dataset 
characteristics for the datasets used to validate performance in this work. In 
subsection 15. 2[ we describe our experimental results. 

5.1 Test Environment and Datasets 

MRGanter and MRGanter+ are implemented in Java using Twister runtime as 
the distributed environment. In addition, a distributed version of CloseByOne 
proposed by Krajca and Vychodil |10] is implemented under the Twister model 
in order to provide a fair comparison for the algorithms proposed in the present 
paper. For convenience, we name this algorithm MRCbo. To illustrate the perfor- 
mance improvement of our distributed approach, we also evaluate NextClosure 
and CloseByOne. 

The experiment were run on the Amazon EC2 cloud computing platform. 
We used High-CPU Medium Instances which had 1.7 GB of memory, 5 EC2 
Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of 
local instance storage, and a 32-bit platform. We selected 3 datasets from UCI 
KDD machine learning repository, mushroom, anon-web, and census-income for 
this evaluation^. These datasets have 8124, 32711, 103950 records and 125, 294, 
133 attributes respectively. We used the percentage of Is to measure the dataset 
density (see row 4 in Table[7]). CPU time was used as the metric for comparing the 
performance of the algorithms. The number of iterations used by each algorithm 
was also recorded in Fig. [5J 

3 http: / /www. naradabrokering.org/ 

4 http://hadooptutorial.wikispaces.com/Iterative+MapReduce+and+Counters 

5 http://archive.ics.uci.edu/ml/index.htiril 
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Table 7. UCI dataset characteristics. These characteristics include the numbers of 
objects and the number of attributes, and the density. 



Dataset 


mushroom 


anon-web 


census-income 


objects 


8124 


32711 


103950 


attributes 


125 


294 


133 


density 


17.36% 


1.03% 


6.7% 



Table 8. Execution time (in seconds) for each algorithm on the datasets. 



Dataset 


mushroom 


anon-web 


census- income 


concepts 


219010 


129009 


96531 


NextClosure 


618 


14671 


18230 


CloseByOnc 


2543 


656 


7465 


MRGanter 


20269(5 nodes) 


20110 (3 nodes) 


9654 (11 nodes) 


MRCbo 


241 (11 nodes) 


693 (11 nodes) 


803 (11 nodes) 


MRGanter+ 


198 (9 nodes) 


496 (9 nodes) 


358 (11 nodes) 



5.2 Results and Analysis 

In Table [8j we present the best test results for the centralized algorithms, 
NextClosure and CloseByOne, and the distributed algorithms, MRGanter, MR- 
Cbo and MRGantei'+. In short, it is clear that MRGanter+ has the best overall 
performance for the mushroom, anon-web and census datasets when 9 nodes and 
11 nodes are used respectively. In comparison with NextClosure, MRGanter + 
saves 68%, 96.6% and 98% in time when processing mushroom, anon-web and 
census-income dataset respectively. For census-income, MRGanter+ has the best 
performance. MRGanter + runs 102 times faster than MRGanter and 1.4 times 
faster than MRCbo. MRCbo runs much faster than CloseByOne when 11 nodes 
are used. It presents a 90.5% saving in time when dealing with the mushroom 
dataset compared with CloseByOne, but there is not much of difference when the 
anon-web dataset is processed. MRGanter takes the longest time to calculate the 
formal concepts for both the mushroom and anon-web datasets. It is much slower 
than even the centralized version, NextClosure. The census-income dataset is an 
exception because MRGanter saves up to half the time with 11 nodes. Among 
the MR* algorithms and centralized algorithms, MRGanter+ achieved the best 
performance. 

To go deep into analysis, let us take scalability into account. Wc tested MR* 
algorithms on a range of nodes and plotted curves for each of them to show 
the ability of the algorithms to decrease computation time by utilizing more 
computers, as indicated in Fig. [2l [3] and [4] for the different datasets. 

In Fig. [21 MRCbo is slower than MRGanter+ although this curve decreases 
faster than MRGantcr+ when we increase the number of nodes. The execution 
time of MRGanter+ is fast even on a single node and the execution time keeps de- 
creasing up to the maximum number of nodes, 11. The performance of MRGanter 
is not beneficially affected by increasing the number of nodes. One explanation 
for this is the overhead incurred by distributing the computation, in particularly 
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Fig. 2. Mushroom dataset: comparison of 
MRGanter+, MRCbo and MRGanter. MR- 
Ganter+ outperforms MRCbo and MR- 
Ganter when dense data is processed. 



Fig. 3. Anon-web dataset: comparison of 
MRGanter+, MRCbo and MRGanter. MR- 
Ganter + is faster when more than 3 nodes 
are used. 



network communication overhead. This is markedly different from MRGanter+, 
because MRGanter+ produces substantially more intermediate data than MR- 
Ganter and MRCbo. Secondly, there is additional computation involved in the 
distributed algorithms in comparison with the centralized versions of these al- 
gorithms. Consider, for instance, the extra operation needed by the merging 
operation. The best number of nodes, where best refers to performance speed, 
depends on the characteristics of the dataset. 

Fig- El demonstrates that MRGanter+ outperforms MRGanter for the anon- 
web dataset. One reason for this performance improvement is that MRGanter + 
produces more concepts during each iteration than MRGanter. Fig. |H] indicates 
that MRGanter+ requires 12, 11 and 9 iterations for each of the datasets, whereas 
MRGanter requires 219010, 129009 and 96531 iterations to obtain all concepts. 
These additional iterations incur higher network communication costs. Fig. 0] 
demonstrates that this is also the case for the census dataset. In addition, the 
curves in Fig. U are steeper than the curves in Fig. [5] and |3] These figures give 
evidence that the performance of the MR* algorithms is related to size and 
density of the data. Based on these results we posit that MR* algorithms scale 
well for large and sparse datasets. This evidence suggest that MR* algorithms 
may be a viable candidate tool for handling large datasets, particularly when it 
is impractical to use a traditional centralized technique. 

Classical formal concept computing methods usually act on, and have lo- 
cal access to the entire database. Network communication is the primary con- 
cern when developing distributed FCA approaches: Frequent requests to remote 
databases incur significant time and resource costs. Performance improvements 
of the algorithms proposed in this paper may potentially arise from preprocess- 
ing the dataset so that the dataset is partitioned in a more optimal manner. 
One direction for improving these algorithms lies in making the partitions more 
even, in terms of density, so that the complexity is distributed more equably. We 
also intend to extend these methods so that they reduce the size of intermediate 
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Table 9. Number of iterations required 
for each of the three datasets. 



Dataset 


mushroom 


anon-web 


census- 
income 


concepts 


219010 


129009 


96531 


Next Closure 


219010 


129009 


96531 


CloseByOnc 


14 


11 


11 


MRGanter 


219010 


129009 


96531 


MRCbo 


14 


11 


11 


MRGanter+ 


12 


11 


9 



Fig. 4. Census dataset: comparison of MR- 
Ganter+, MRCbo and MRGanter. MRGan- 
ter+ is fastest when a large dataset is pro- 
cessed. 



data produced in each iteration. We propose to extend this empirical study in a 
companion paper which examines algorithm performance on larger dataset sizes. 



6 Conclusion 



In this paper we considered methods for extending the NextClosure FCA algo- 
rithm. A formal description of dealing with distributed datasets for the NextClo- 
sure FCA was discussed. Two new distributed FCA algorithms, MRGanter and 
MRGanter+, were proposed based on this discussion. Various implementation 
aspects of these approaches were discussed based on empirical evaluation of the 
algorithms. These experiments demonstrated the advantages of our approach 
and the scalability in particular of MRGanter+. By comparing MRGanter+ with 
MRCbo and MRGanter, we found that the number of iterations significantly im- 
pacted the performance of distributed FCA, a promising result. In future work 
we hope to capitalize on this by improving the MR* methodology by reducing 
the number of iterations of these approaches and to further reduce computation 
time. 
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